1 Prompt

Dear Candidate,

Thank you for spending your valuable time with us during the Logitech interview. We are happy to announce that you have been shortlisted for the next phase of our hiring.

Attached below is some simple mock up data. The intention is a simple analytical exercise for you, to enable us to assess your coding and analytical skills.

You are not expected to spend more than 2 hours on this assignment.

What we are looking for:

  1. Ability to visualize and provide insights on the data provided.
  2. Ability to code, wrangle data in the language of your preference (it can be either R or Python, R preferred).
  3. Ability to work with unclean data.
  4. Maturity in coding style.
  5. Critique solution and offer next steps, if any.

1.1 Task Overview

  1. Data exploration: Look into the data and tell us about your findings. For instance, one finding might be around yearly growth rates, but it’s up to you to identify interesting information. Graphs that complement this are welcomed.
  2. Forecasting: Please create a 12 month forecast for some of the most important product categories. You are again free to choose one or several forecast algorithms.

1.2 Data File Details

  • The attached csv file contains monthly sales data for various product categories/regions. The data is not from Logitech but an external data source and randomly modified. We still require that you treat this information confidentially and do not share it with anyone. Please delete the data once the recruitment process is terminated.
  • You can freely choose the technical tools you use for the analysis, with a preference for R.
  • The data is almost but not entirely clean, so please check data quality first.

1.3 Submission

  1. A summary document containing your findings for the data exploration, charts of your forecasts and a description of your forecast method(s).
  2. The various scripts you used for the assignment.
  3. A csv file containing the forecast data.

2 EDA

2.1 Initital EDA

(Code and output hidden; see .rmd for code)

Category1 Category2 Category3 10-Dec 11-Jan 11-Feb 11-Mar 11-Apr 11-May 11-Jun 11-Jul 11-Aug 11-Sep 11-Oct 11-Nov 11-Dec 12-Jan 12-Feb 12-Mar 12-Apr 12-May 12-Jun 12-Jul 12-Aug 12-Sep 12-Oct 12-Nov 12-Dec 13-Jan 13-Feb 13-Mar 13-Apr 13-May 13-Jun 13-Jul 13-Aug 13-Sep 13-Oct 13-Nov 13-Dec 14-Jan 14-Feb 14-Mar 14-Apr 14-May 14-Jun 14-Jul 14-Aug 14-Sep 14-Oct 14-Nov 14-Dec 15-Jan 15-Feb 15-Mar 15-Apr 15-May 15-Jun 15-Jul 15-Aug 15-Sep 15-Oct 15-Nov 15-Dec 16-Jan 16-Feb 16-Mar 16-Apr 16-May 16-Jun 16-Jul 16-Aug
NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
A X W 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 30 0 0 5 0 0 570 2061 13822 16730 13178 9814 10166 6495 27470 57135 24230 22576 25627 21283 21486 31879 25246 27515
NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
A A A 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 445387 409590 446587 313901 294959 371677 311436 342033 386121 285165 301804 508148 278061 310467 358239 248998 232080 302672 267015 322004 374625 297737 452887 705445 258244 298029 339785 233427 224777 291458 241622 293511 356579 264255 337553 679924 289284 319954 351710 253872 261584 340386 275873 332474
NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
A A B 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 734161 685919 789018 509878 488771 640080 466612 598391 639250 410131 556360 984537 427590 498764 575448 378067 381543 481508 390322 474441 542372 394104 442740 813362 395002 470135 464293 348825 327153 388646 311547 387380 448017 294690 366460 715109 305632 315760 369734 261899 239278 297683 250246 292801
NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
A A C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 35450 32177 40849 25757 27031 39493 30014 37560 39276 26281 30080 67193 23758 39006 41006 24468 23331 37436 28907 34107 34987 19505 25061 65220 25691 27354 28577 21515 22181 23848 30147 24904 25861 18881 25321 48449 20415 23348 27214 18392 23174 20951 17712 18621
NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Rows: 282
Columns: 72
$ Category1 <chr> NA, NA, "A", NA, NA, "A", NA, NA, "A", NA, NA, "A", NA, NA, …
$ Category2 <chr> NA, NA, "X", NA, NA, "A", NA, NA, "A", NA, NA, "A", NA, NA, …
$ Category3 <chr> NA, NA, "W", NA, NA, "A", NA, NA, "B", NA, NA, "C", NA, NA, …
$ `10-Dec`  <dbl> NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, N…
$ `11-Jan`  <dbl> NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, N…
$ `11-Feb`  <dbl> NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, N…
$ `11-Mar`  <dbl> NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, N…
$ `11-Apr`  <dbl> NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, N…
$ `11-May`  <dbl> NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, N…
$ `11-Jun`  <dbl> NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, N…
$ `11-Jul`  <dbl> NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, N…
$ `11-Aug`  <dbl> NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, N…
$ `11-Sep`  <dbl> NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, N…
$ `11-Oct`  <dbl> NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, N…
$ `11-Nov`  <dbl> NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, N…
$ `11-Dec`  <dbl> NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, N…
$ `12-Jan`  <dbl> NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, N…
$ `12-Feb`  <dbl> NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, N…
$ `12-Mar`  <dbl> NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, N…
$ `12-Apr`  <dbl> NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, N…
$ `12-May`  <dbl> NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, N…
$ `12-Jun`  <dbl> NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, N…
$ `12-Jul`  <dbl> NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, N…
$ `12-Aug`  <dbl> NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, N…
$ `12-Sep`  <dbl> NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, N…
$ `12-Oct`  <dbl> NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, N…
$ `12-Nov`  <dbl> NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, N…
$ `12-Dec`  <dbl> NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, NA, 0, NA, N…
$ `13-Jan`  <dbl> NA, NA, 0, NA, NA, 445387, NA, NA, 734161, NA, NA, 35450, NA…
$ `13-Feb`  <dbl> NA, NA, 0, NA, NA, 409590, NA, NA, 685919, NA, NA, 32177, NA…
$ `13-Mar`  <dbl> NA, NA, 0, NA, NA, 446587, NA, NA, 789018, NA, NA, 40849, NA…
$ `13-Apr`  <dbl> NA, NA, 0, NA, NA, 313901, NA, NA, 509878, NA, NA, 25757, NA…
$ `13-May`  <dbl> NA, NA, 0, NA, NA, 294959, NA, NA, 488771, NA, NA, 27031, NA…
$ `13-Jun`  <dbl> NA, NA, 0, NA, NA, 371677, NA, NA, 640080, NA, NA, 39493, NA…
$ `13-Jul`  <dbl> NA, NA, 0, NA, NA, 311436, NA, NA, 466612, NA, NA, 30014, NA…
$ `13-Aug`  <dbl> NA, NA, 0, NA, NA, 342033, NA, NA, 598391, NA, NA, 37560, NA…
$ `13-Sep`  <dbl> NA, NA, 0, NA, NA, 386121, NA, NA, 639250, NA, NA, 39276, NA…
$ `13-Oct`  <dbl> NA, NA, 0, NA, NA, 285165, NA, NA, 410131, NA, NA, 26281, NA…
$ `13-Nov`  <dbl> NA, NA, 0, NA, NA, 301804, NA, NA, 556360, NA, NA, 30080, NA…
$ `13-Dec`  <dbl> NA, NA, 0, NA, NA, 508148, NA, NA, 984537, NA, NA, 67193, NA…
$ `14-Jan`  <dbl> NA, NA, 0, NA, NA, 278061, NA, NA, 427590, NA, NA, 23758, NA…
$ `14-Feb`  <dbl> NA, NA, 0, NA, NA, 310467, NA, NA, 498764, NA, NA, 39006, NA…
$ `14-Mar`  <dbl> NA, NA, 0, NA, NA, 358239, NA, NA, 575448, NA, NA, 41006, NA…
$ `14-Apr`  <dbl> NA, NA, 0, NA, NA, 248998, NA, NA, 378067, NA, NA, 24468, NA…
$ `14-May`  <dbl> NA, NA, 0, NA, NA, 232080, NA, NA, 381543, NA, NA, 23331, NA…
$ `14-Jun`  <dbl> NA, NA, 0, NA, NA, 302672, NA, NA, 481508, NA, NA, 37436, NA…
$ `14-Jul`  <dbl> NA, NA, 0, NA, NA, 267015, NA, NA, 390322, NA, NA, 28907, NA…
$ `14-Aug`  <dbl> NA, NA, 0, NA, NA, 322004, NA, NA, 474441, NA, NA, 34107, NA…
$ `14-Sep`  <dbl> NA, NA, 30, NA, NA, 374625, NA, NA, 542372, NA, NA, 34987, N…
$ `14-Oct`  <dbl> NA, NA, 0, NA, NA, 297737, NA, NA, 394104, NA, NA, 19505, NA…
$ `14-Nov`  <dbl> NA, NA, 0, NA, NA, 452887, NA, NA, 442740, NA, NA, 25061, NA…
$ `14-Dec`  <dbl> NA, NA, 5, NA, NA, 705445, NA, NA, 813362, NA, NA, 65220, NA…
$ `15-Jan`  <dbl> NA, NA, 0, NA, NA, 258244, NA, NA, 395002, NA, NA, 25691, NA…
$ `15-Feb`  <dbl> NA, NA, 0, NA, NA, 298029, NA, NA, 470135, NA, NA, 27354, NA…
$ `15-Mar`  <dbl> NA, NA, 570, NA, NA, 339785, NA, NA, 464293, NA, NA, 28577, …
$ `15-Apr`  <dbl> NA, NA, 2061, NA, NA, 233427, NA, NA, 348825, NA, NA, 21515,…
$ `15-May`  <dbl> NA, NA, 13822, NA, NA, 224777, NA, NA, 327153, NA, NA, 22181…
$ `15-Jun`  <dbl> NA, NA, 16730, NA, NA, 291458, NA, NA, 388646, NA, NA, 23848…
$ `15-Jul`  <dbl> NA, NA, 13178, NA, NA, 241622, NA, NA, 311547, NA, NA, 30147…
$ `15-Aug`  <dbl> NA, NA, 9814, NA, NA, 293511, NA, NA, 387380, NA, NA, 24904,…
$ `15-Sep`  <dbl> NA, NA, 10166, NA, NA, 356579, NA, NA, 448017, NA, NA, 25861…
$ `15-Oct`  <dbl> NA, NA, 6495, NA, NA, 264255, NA, NA, 294690, NA, NA, 18881,…
$ `15-Nov`  <dbl> NA, NA, 27470, NA, NA, 337553, NA, NA, 366460, NA, NA, 25321…
$ `15-Dec`  <dbl> NA, NA, 57135, NA, NA, 679924, NA, NA, 715109, NA, NA, 48449…
$ `16-Jan`  <dbl> NA, NA, 24230, NA, NA, 289284, NA, NA, 305632, NA, NA, 20415…
$ `16-Feb`  <dbl> NA, NA, 22576, NA, NA, 319954, NA, NA, 315760, NA, NA, 23348…
$ `16-Mar`  <dbl> NA, NA, 25627, NA, NA, 351710, NA, NA, 369734, NA, NA, 27214…
$ `16-Apr`  <dbl> NA, NA, 21283, NA, NA, 253872, NA, NA, 261899, NA, NA, 18392…
$ `16-May`  <dbl> NA, NA, 21486, NA, NA, 261584, NA, NA, 239278, NA, NA, 23174…
$ `16-Jun`  <dbl> NA, NA, 31879, NA, NA, 340386, NA, NA, 297683, NA, NA, 20951…
$ `16-Jul`  <dbl> NA, NA, 25246, NA, NA, 275873, NA, NA, 250246, NA, NA, 17712…
$ `16-Aug`  <dbl> NA, NA, 27515, NA, NA, 332474, NA, NA, 292801, NA, NA, 18621…
Category1 Category2 Category3    10-Dec    11-Jan    11-Feb    11-Mar    11-Apr 
      188       188       188       188       188       188       188       188 
   11-May    11-Jun    11-Jul    11-Aug    11-Sep    11-Oct    11-Nov    11-Dec 
      188       188       188       188       188       188       188       188 
   12-Jan    12-Feb    12-Mar    12-Apr    12-May    12-Jun    12-Jul    12-Aug 
      188       188       188       188       188       188       188       188 
   12-Sep    12-Oct    12-Nov    12-Dec    13-Jan    13-Feb    13-Mar    13-Apr 
      188       188       188       188       188       188       188       188 
   13-May    13-Jun    13-Jul    13-Aug    13-Sep    13-Oct    13-Nov    13-Dec 
      188       188       188       188       188       188       188       188 
   14-Jan    14-Feb    14-Mar    14-Apr    14-May    14-Jun    14-Jul    14-Aug 
      188       188       188       188       188       188       188       188 
   14-Sep    14-Oct    14-Nov    14-Dec    15-Jan    15-Feb    15-Mar    15-Apr 
      188       188       188       188       188       188       188       188 
   15-May    15-Jun    15-Jul    15-Aug    15-Sep    15-Oct    15-Nov    15-Dec 
      188       188       188       188       188       188       188       188 
   16-Jan    16-Feb    16-Mar    16-Apr    16-May    16-Jun    16-Jul    16-Aug 
      188       188       188       188       188       188       188       188 
[1] 13536
Combinations and Counts
Category1 Category2 Category3 n
NA NA NA 188
A J NULL 5
A J O 5
A C NULL 4
B C W 3
C C W 3
A C W 2
A A A 1
A A B 1
A A C 1
A A E 1
A A I 1
A A M 1
A A V 1
A A X 1
A B P 1
A B Q 1
A B R 1
A D NULL 1
A E U 1
A F L 1
A F T 1
A G D 1
A G F 1
A G H 1
A G J 1
A H A3 1
A I A2 1
A I G 1
A I K 1
A I NULL 1
A J A1 1
A J Z 1
A X W 1
B A A 1
B A B 1
B A C 1
B A E 1
B A I 1
B A N 1
B A V 1
B A Y 1
B B P 1
B B Q 1
B B R 1
B E U 1
B F L 1
B F S 1
B F T 1
B G D 1
B G F 1
B G H 1
B G J 1
B I A2 1
B I G 1
B I K 1
B J NULL 1
C A A 1
C A B 1
C A C 1
C A E 1
C A I 1
C A V 1
C A Y 1
C B P 1
C B Q 1
C B R 1
C E U 1
C F L 1
C F T 1
C G D 1
C G F 1
C G H 1
C G J 1
C I A2 1
C I G 1
C I K 1
C J NULL 1
C J Z 1

2.2 Wrangle the Data

(Code and output hidden; see .rmd for code)

Rows: 6,486
Columns: 5
$ Category1 <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", …
$ Category2 <chr> "X", "X", "X", "X", "X", "X", "X", "X", "X", "X", "X", "X", …
$ Category3 <chr> "W", "W", "W", "W", "W", "W", "W", "W", "W", "W", "W", "W", …
$ Date      <date> 2010-12-01, 2011-01-01, 2011-02-01, 2011-03-01, 2011-04-01,…
$ Value     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
Number of NA Dates: 0 

2.3 Final EDA

(Code and output hidden; see .rmd for code)

2.4 EDA Analysis

Following the Exploratory Data Analysis (EDA), I plan to create time series objects from the following Category Code combinations:

  • A, A, M
  • B, C, W
  • C, C, W

My rationale:

  • Coverage: These three combinations represent approximately \(\approx 15.11\% \,(A, A, M) + \approx 11.68\% \,(B, C, W) + \approx 11.48\% \,(C, C, W) \approx 38.77\%\) of the dataset’s total value (USD $5,659,237,273.00).
  • Representation: They conveniently include one of each value from Category 1.

I believe these combinations offer the best holistic chance of predicting future values due to their significant influence on the dataset.

3 Time Series Model Selection

3.1 Check for Date or DF

# Assign data to ts_data (assuming 'data_wrangled' is already prepared)
ts_data <- data_wrangled

# Data Validation: Ensure essential structure for time series analysis
if (!is.data.frame(ts_data) || !any(tolower(names(ts_data)) == "date")) {
  stop("Error: 'ts_data' is not a dataframe or 'date' column does not exist.") # Stop execution if the data is invalid
}

# Confirmation Message (if the code reaches here, validation checks have passed)
print("Check passed: 'ts_data' is a dataframe and contains a 'date' column.")
[1] "Check passed: 'ts_data' is a dataframe and contains a 'date' column."

3.2 Convert Column(s) to TS

# Assumptions (Make sure these align with your data)
# - 'ts_data' is loaded with columns: Category1, Category2, Category3, Date, Value
# - Your  goal is to create separate time series for different category combinations

# Parameters 
start_date <- as.Date("2013-01-01") 
end_date <- as.Date("2016-08-31") 
frequency <- 12 # Monthly data

# Predefined Category Combinations 
categories <- list(
  c("A", "A", "M"), 
  c("B", "C", "W"),
  c("C", "C", "W")
)

# Storage for Time Series and Output
ts_list <- list()     # Stores time series objects
output_text <- list() # Stores text output for reporting

# Category Conversion and Verification Loop
for (i in seq_along(categories)) {
  category <- categories[[i]]

  # Filtering and Aggregation
  filtered_data <- ts_data %>%
    filter(Category1 == category[1], Category2 == category[2], Category3 == category[3],
           Date >= start_date, Date <= end_date) %>%
    group_by(Date) %>%
    summarise(Value = sum(Value)) %>%   # Ensure 'sum' is your intended aggregation 
    arrange(Date) 

  # Time Series Creation 
  ts_object <- ts(filtered_data$Value, 
                  start = c(year(min(filtered_data$Date)), month(min(filtered_data$Date))),
                  frequency = frequency)

  # Store Time Series with Descriptive Name
  ts_list_name <- paste(category, collapse = "_")
  ts_list[[ts_list_name]] <- ts_object

  # Data Consistency Checks (with formatted output)
  output_text[[ts_list_name]] <- capture.output({
    formatted_start <- format(min(filtered_data$Date), "%Y-%m-%d")
    formatted_end <- format(max(filtered_data$Date), "%Y-%m-%d")

    cat("\n---------------------------------------------\n")
    cat("Category combination: ", ts_list_name, "\n\n")
    cat("Summary for this category:\n")
    cat("- Date range in data: [", formatted_start, " - ", formatted_end, "]\n")
    cat("- Time series length: ", length(ts_object), "\n")
    cat("- Expected periods (unique dates): ", length(unique(filtered_data$Date)), "\n")
    cat("- Data points used: ", nrow(filtered_data), "\n")
    cat("---------------------------------------------\n")
  })
}

# Display Verification Output 
for (name in names(output_text)) {
  cat("\nOutput for category combination:", name, "\n")
  cat(output_text[[name]], sep="\n")
}

Output for category combination: A_A_M 

---------------------------------------------
Category combination:  A_A_M 

Summary for this category:
- Date range in data: [ 2013-01-01  -  2016-08-01 ]
- Time series length:  44 
- Expected periods (unique dates):  44 
- Data points used:  44 
---------------------------------------------

Output for category combination: B_C_W 

---------------------------------------------
Category combination:  B_C_W 

Summary for this category:
- Date range in data: [ 2013-01-01  -  2016-08-01 ]
- Time series length:  44 
- Expected periods (unique dates):  44 
- Data points used:  44 
---------------------------------------------

Output for category combination: C_C_W 

---------------------------------------------
Category combination:  C_C_W 

Summary for this category:
- Date range in data: [ 2013-01-01  -  2016-08-01 ]
- Time series length:  44 
- Expected periods (unique dates):  44 
- Data points used:  44 
---------------------------------------------

3.3 Plot - Original Time Series

# Assumptions
# - 'start_date' has been defined previously 
# - 'ts_list' contains a list of your time series objects

# Store Plots for Later Use 
time_series_plots_list <- list() 

# Create and Display Plots 
for (name in names(ts_list)) {
  ts_object <- ts_list[[name]] 

  # Generate Date Sequence for Consistent Plotting 
  date_seq <- seq(start_date, by = "month", length.out = length(ts_object))

  # Create ggplot2 Time Series Plot
  plot <- ggplot(data.frame(Time = date_seq, Value = as.numeric(ts_object)), aes(x = Time, y = Value)) +
    geom_line() +
    labs(title = paste("Time Series for:", name), 
         x = "Time", 
         y = "Value") +  
    theme_minimal(base_size = 14) + 
    theme(plot.title = element_text(size = 20, hjust = 0.5, face = "bold"), 
          axis.text.x = element_text(angle = 45, hjust = 1),  
          axis.title.x = element_text(size = 14, face = "bold"), 
          axis.title.y = element_text(size = 14, face = "bold")) 

  # Store and Print Plot 
  time_series_plots_list[[name]] <- plot 
  print(plot) 
}

3.4 Plot - STL Decomposition

# Store STL Plots for Later Use
stl_plots_list <- list() 

# STL Decomposition and Plotting
for (name in names(ts_list)) {
  ts_object <- ts_list[[name]] 

  # Decompose Time Series (STL)
  ts_stl <- stl(ts_object, s.window = "periodic", robust = TRUE) 

  # Create STL Plot with Enhanced Formatting  
  stl_plot <- autoplot(ts_stl) +
    labs(title = paste("STL Decomposition for:", name), 
         x = "Time", 
         y = "Value") +  
    theme_minimal(base_size = 14) + 
    theme(plot.title = element_text(size = 20, hjust = 0.5, face = "bold"), 
          axis.text.x = element_text(angle = 45, hjust = 1),  
          axis.title.x = element_text(size = 14, face = "bold"), 
          axis.title.y = element_text(size = 14, face = "bold"), 
          strip.text.x = element_text(size = 16, face = "bold"), 
          strip.background = element_rect(fill = "lightblue", colour = "deepskyblue", size = 1)) 

  # Store and Print
  stl_plots_list[[name]] <- stl_plot  
  print(stl_plot) 
}
STL Decomposition of Job Openings. The plot displays four panels in order: Observed Data

STL Decomposition of Job Openings. The plot displays four panels in order: Observed Data

STL Decomposition of Job Openings. The plot displays four panels in order: Observed Data

STL Decomposition of Job Openings. The plot displays four panels in order: Observed Data

STL Decomposition of Job Openings. The plot displays four panels in order: Observed Data

STL Decomposition of Job Openings. The plot displays four panels in order: Observed Data

3.5 Check Seasonality Levels

# Store Seasonality Assessment Results
seasonality_output <- list()

# Analyze Seasonality of Time Series 
for (name in names(ts_list)) {
  ts_object <- ts_list[[name]]

  # Decompose Time Series (STL)
  stl_object <- stl(ts_object, s.window = "periodic") 

  # Measure Seasonality Strength (MAD)
  seasonal_comp <- stl_object$time.series[, "seasonal"] 
  seasonal_mad <- mean(abs(seasonal_comp - mean(seasonal_comp))) 

  # Interpret Seasonality Strength  
  seasonality_assessment <- if (seasonal_mad > 0.2) {
    "The time series exhibits significant seasonality.\n" 
  } else if (seasonal_mad > 0.1) {
    "The time series exhibits some seasonality.\n"
  } else {
    "The time series likely does not exhibit significant seasonality.\n" 
  }

  # Store Seasonality Analysis Report
  seasonality_output[[name]] <- capture.output({
    cat("\n---------------------------------------------\n")
    cat("Time Series Analysis: ", name, "\n\n")
    cat("Seasonality Assessment Summary:\n")
    cat(sprintf("Mean Absolute Deviation (MAD) of the seasonal component: %.2f\n", seasonal_mad))
    cat(seasonality_assessment)
    cat("---------------------------------------------\n")
  })
}

# Display Assessment Reports
for (name in names(seasonality_output)) {
  cat("\nOutput for Time Series:", name, "\n")
  cat(seasonality_output[[name]], sep="\n") 
}

Output for Time Series: A_A_M 

---------------------------------------------
Time Series Analysis:  A_A_M 

Seasonality Assessment Summary:
Mean Absolute Deviation (MAD) of the seasonal component: 3878397.13
The time series exhibits significant seasonality.
---------------------------------------------

Output for Time Series: B_C_W 

---------------------------------------------
Time Series Analysis:  B_C_W 

Seasonality Assessment Summary:
Mean Absolute Deviation (MAD) of the seasonal component: 2342388.23
The time series exhibits significant seasonality.
---------------------------------------------

Output for Time Series: C_C_W 

---------------------------------------------
Time Series Analysis:  C_C_W 

Seasonality Assessment Summary:
Mean Absolute Deviation (MAD) of the seasonal component: 2593021.80
The time series exhibits significant seasonality.
---------------------------------------------

3.6 Differencing Tests

# Store Results for Later 
differencing_output <- list()
differenced_ts_list <- list()

# Differencing Analysis Loop
for (name in names(ts_list)) {
  # Start with a Copy of Original Data
  current_data <- ts_list[[name]] 
  max_iterations <- 5  # Set a maximum for differencing attempts
  iterations <- 0
  seasonal_period <- frequency

  # Store Differencing Test Report
  differencing_output[[name]] <- capture.output({
    cat("\n---------------------------------------------\n")
    cat("Performing Differencing Tests for: ", name, "\n\n")

    # Seasonal Differencing (if applicable)
    if (seasonal_period > 1) {
      current_data <- diff(current_data, lag = seasonal_period) 
      iterations <- iterations + 1
      cat("Seasonal differencing applied with lag =", seasonal_period, "\n")

      # Update ts object after seasonal differencing
      current_data <- ts(current_data, start = start(ts_list[[name]]), frequency = frequency) 
    }

    # Iterative Regular Differencing
    while (iterations < max_iterations) {
      adf_result <- adf.test(current_data, alternative = "stationary") 

      # Stop if stationary 
      if (adf_result$p.value < 0.05) {
        break
      } 

      # Otherwise, difference and update
      current_data <- diff(current_data) 
      iterations <- iterations + 1
      cat(sprintf("After differencing %d times, p-value is %.5f \n", iterations, adf_result$p.value))

      #  Update ts object after differencing
      current_data <- ts(current_data, start = start(ts_list[[name]]), frequency = frequency) 
    }

    # Final Stationarity Assessment 
    if (adf_result$p.value < 0.05) {
      cat(sprintf("Time Series %s appears stationary after %d differencing operations.\n", name, iterations))
    } else {
      cat(sprintf("Time Series %s is still non-stationary after maximum allowed differencing operations.\n", name))
    }
    cat("\n---------------------------------------------\n")
  })

  # Store Final Differenced Data  
  differenced_ts_list[[name]] <- current_data 
}

# Display Test Reports
for (name in names(differencing_output)) {
  cat("\nDifferencing Test Output for Time Series:", name, "\n")
  cat(differencing_output[[name]], sep="\n") 
}

Differencing Test Output for Time Series: A_A_M 

---------------------------------------------
Performing Differencing Tests for:  A_A_M 

Seasonal differencing applied with lag = 12 
After differencing 2 times, p-value is 0.55791 
After differencing 3 times, p-value is 0.16701 
Time Series A_A_M appears stationary after 3 differencing operations.

---------------------------------------------

Differencing Test Output for Time Series: B_C_W 

---------------------------------------------
Performing Differencing Tests for:  B_C_W 

Seasonal differencing applied with lag = 12 
After differencing 2 times, p-value is 0.47475 
After differencing 3 times, p-value is 0.09616 
Time Series B_C_W appears stationary after 3 differencing operations.

---------------------------------------------

Differencing Test Output for Time Series: C_C_W 

---------------------------------------------
Performing Differencing Tests for:  C_C_W 

Seasonal differencing applied with lag = 12 
After differencing 2 times, p-value is 0.21465 
Time Series C_C_W appears stationary after 2 differencing operations.

---------------------------------------------

3.7 Plot - Differencing

# Store Plots 
differencing_plots_list <- list()

# Differencing and Plotting Loop
for (name in names(ts_list)) {
  current_data <- ts_list[[name]]
  max_iterations <- 5  
  iterations <- 0
  seasonal_period <- frequency # Assuming 'frequency' is defined earlier 

  # Seasonal Differencing (if needed)
  if (seasonal_period > 1) {
    current_data <- diff(current_data, lag = seasonal_period)
    iterations <- 1 
  }

  # Iterative Regular Differencing 
  while (iterations < max_iterations) {
    adf_result <- adf.test(current_data, alternative = "stationary") 
    if (adf_result$p.value < 0.05) { 
      break 
    } 
    current_data <- diff(current_data) 
    iterations <- iterations + 1 
  }

  # Prepare for Plotting (adjusting for differencing)
  date_seq <- seq(start_date, by = "month", length.out = length(current_data)) 

  # Create and Store Differenced Time Series Plot
  plot <- ggplot(data.frame(Date = date_seq, Value = as.numeric(current_data)), aes(x = Date, y = Value)) +
    geom_line() +
    labs(title = paste("Time Series after", iterations, "Differencing(s) for:", name), 
         x = "Date", 
         y = "Value") +  
    theme_minimal() + 
    theme(plot.title = element_text(hjust = 0.5)) 

  differencing_plots_list[[name]] <- plot 
  print(plot) 
}
Differencing Test & Visualization. This plot illustrates the final state of the time series after applying the necessary differencing operations

Differencing Test & Visualization. This plot illustrates the final state of the time series after applying the necessary differencing operations

Differencing Test & Visualization. This plot illustrates the final state of the time series after applying the necessary differencing operations

Differencing Test & Visualization. This plot illustrates the final state of the time series after applying the necessary differencing operations

Differencing Test & Visualization. This plot illustrates the final state of the time series after applying the necessary differencing operations

Differencing Test & Visualization. This plot illustrates the final state of the time series after applying the necessary differencing operations

3.8 Augmented Dickey-Fuller test

# Assuming ts_list contains your time series objects
for (name in names(ts_list)) {
  cat("\n---------------------------------------------\n")
  cat(sprintf("Augmented Dickey-Fuller Test for: %s\n", name)) 

  current_data <- ts_list[[name]] 

  # Perform the ADF Test
  adf_result <- adf.test(current_data, alternative = "stationary") 

  # Explain Results Clearly
  cat(sprintf("ADF Test Results for %s:\n", name))
  cat(sprintf("Test Statistic: %.4f, P-value: %.4f\n", adf_result$statistic, adf_result$p.value))
  cat(sprintf("Critical Values: %.4f (1%%), %.4f (5%%), %.4f (10%%)\n", adf_result$critical[["1%"]], adf_result$critical[["5%"]], adf_result$critical[["10%"]]))

  # Interpret the Results (with guidance)
  if (adf_result$p.value < 0.05) {
    cat("Conclusion: The time series appears to be stationary.\n")  
  } else {
    cat("Conclusion: The time series may still be non-stationary. Consider differencing to achieve stationarity.\n") 
  }
  cat("\n---------------------------------------------\n")
}

---------------------------------------------
Augmented Dickey-Fuller Test for: A_A_M
ADF Test Results for A_A_M:
Test Statistic: -3.7219, P-value: 0.0346
Conclusion: The time series appears to be stationary.

---------------------------------------------

---------------------------------------------
Augmented Dickey-Fuller Test for: B_C_W
ADF Test Results for B_C_W:
Test Statistic: -2.4227, P-value: 0.4064
Conclusion: The time series may still be non-stationary. Consider differencing to achieve stationarity.

---------------------------------------------

---------------------------------------------
Augmented Dickey-Fuller Test for: C_C_W
ADF Test Results for C_C_W:
Test Statistic: -3.7192, P-value: 0.0348
Conclusion: The time series appears to be stationary.

---------------------------------------------

3.9 Box-Ljung Test

# Loop through time series in your list
for (name in names(ts_list)) {
  cat(sprintf("Box-Ljung Test for: %s\n", name)) 

  # Extract Time Series Data
  current_data <- ts_list[[name]] 

  # Estimate Time Series Frequency (with check)
  numeric_frequency_estimate <- frequency(current_data) 
  if (is.na(numeric_frequency_estimate)) {
    stop("Numeric frequency estimation failed. Please check the calculations.") 
  }

  # Determine Lag for Test (adjustable rule)
  lag_for_test <- max(1, min(20, numeric_frequency_estimate)) # Adjust rule if needed
  
  # Perform Box-Ljung Test (check residuals for autocorrelation)
  box_test_result <- Box.test(current_data, lag = lag_for_test, type = "Ljung-Box") 

  # Display Test Results
  print(box_test_result) 

  cat("\n---------------------------------------------\n") 
}
Box-Ljung Test for: A_A_M

    Box-Ljung test

data:  current_data
X-squared = 42.442, df = 12, p-value = 2.806e-05


---------------------------------------------
Box-Ljung Test for: B_C_W

    Box-Ljung test

data:  current_data
X-squared = 82.307, df = 12, p-value = 1.495e-12


---------------------------------------------
Box-Ljung Test for: C_C_W

    Box-Ljung test

data:  current_data
X-squared = 59.267, df = 12, p-value = 3.069e-08


---------------------------------------------

3.10 Plots for Model Selection

# Function to generate ACF and PACF plots for differenced data
generate_acf_pacf_plots <- function(ts_data, name) {
  # Generate ACF plot
  acf_plot <- forecast::Acf(ts_data, plot = FALSE)
  # Generate PACF plot
  pacf_plot <- forecast::Pacf(ts_data, plot = FALSE)
  
  # Convert to ggplot objects using autoplot
  acf_ggplot <- ggplot2::autoplot(acf_plot) +
    ggtitle(paste("ACF for:", name)) +
    theme_minimal()
  
  pacf_ggplot <- ggplot2::autoplot(pacf_plot) +
    ggtitle(paste("PACF for:", name)) +
    theme_minimal()
  
  return(list(acf_plot = acf_ggplot, pacf_plot = pacf_ggplot)) 
}

# Assuming 'differenced_ts_list' contains the final differenced data,
# Iterate over each time series to generate and arrange plots
for (name in names(differenced_ts_list)) {
  differenced_data <- differenced_ts_list[[name]]

  # Generate ACF and PACF plots
  acf_pacf_plots <- generate_acf_pacf_plots(differenced_data, name)

  # Arrange plots for visual comparison
  gridExtra::grid.arrange(
    acf_pacf_plots$acf_plot, 
    acf_pacf_plots$pacf_plot, 
    ncol = 2, 
    top = paste("Visual Diagnostics for Model Selection:", name)
  )
}
Visual Diagnostics for Model Selection: This figure compares the original time series, its differenced version (for stationarity), and the ACF and PACF plots of the differenced series.

Visual Diagnostics for Model Selection: This figure compares the original time series, its differenced version (for stationarity), and the ACF and PACF plots of the differenced series.

Visual Diagnostics for Model Selection: This figure compares the original time series, its differenced version (for stationarity), and the ACF and PACF plots of the differenced series.

Visual Diagnostics for Model Selection: This figure compares the original time series, its differenced version (for stationarity), and the ACF and PACF plots of the differenced series.

Visual Diagnostics for Model Selection: This figure compares the original time series, its differenced version (for stationarity), and the ACF and PACF plots of the differenced series.

Visual Diagnostics for Model Selection: This figure compares the original time series, its differenced version (for stationarity), and the ACF and PACF plots of the differenced series.

3.11 Determination of Model Selection

3.11.1 A, A, M

3.11.1.1 Autocorrelation Function (ACF)

The ACF plot for the A,A,M time series shows statistically significant correlations extending out to around 12 lags. There are spikes at lags 1-3, 12, 13, and 24. The periodic spikes indicate potential seasonality at a 12 month period. The earlier spikes suggests there may be a moving average component or a short-memory autoregressive component.

3.11.1.2 Partial Autocorrelation Function (PACF)

The PACF plot cuts off decisively after lag 1 and then mostly stays between the significance bands, swinging around 0, characteristic of an AR(1) process. There is also a smaller spike at lag 12, providing further evidence of the 12-month seasonal cycle. The sharp cutoff supports that the ACF spikes are likely from a moving average process rather than higher order AR.

3.11.1.3 Differencing

The Augmented Dickey-Fuller test showed this time series was non-stationary. After 1 seasonal difference and 3 consecutive regular differences, stationarity was achieved. This tells us:

  • There was a non-stationary long-term trend that was removed by the regular differencing, supporting an AR and/or MA component(s).
  • A seasonal difference was needed, confirming the 12-month seasonal pattern.

3.11.1.4 Seasonality

The seasonality diagnostic gave a very high Mean Absolute Deviation (MAD) value of around 3.9 million for the seasonal component. This supports the ACF/PACF indications that there is a strong seasonal effect at 12 months.

3.11.1.5 Final Model Selection

Based on putting all the above analyses together:

  • The PACF cutoff supports an AR(1) process
  • The need for 3 regular differences suggests including an MA(3)
  • The seasonal PACF spike and seasonality check indicates adding a SAR(1) at 12 months
  • No support found for seasonal MA, so SMA is left at 0

Therefore, I would start with a SARIMA(1,1,3)(1,1,0)[12] model to forecast this time series.

3.11.2 B, C, W

3.11.2.1 Autocorrelation Function (ACF)

The ACF plot for the B,C,W time series shows statistically significant correlations extending out to around lags 12-15. There are additional spikes at lags 1-3. The periodic seasonal spikes indicate a potential 12-month seasonal cycle. The earlier spikes suggests there may be a moving average component or a short-memory autoregressive component.

3.11.2.2 Partial Autocorrelation Function (PACF)

The PACF plot cuts off decisively after lag 1 and then oscillates around zero between the significance bands, characteristic of an AR(1) process. There is also a smaller spike at lag 12, providing further evidence of the 12-month seasonal pattern. The sharp cutoff supports that the ACF spikes are likely from an MA process rather than a higher order AR model.

3.11.2.3 Differencing

The Augmented Dickey-Fuller test showed this time series was non-stationary initially. After applying 1 seasonal difference and 3 regular differences, stationarity was achieved. This tells us:

  • There was a non-stationary trend that was eliminated by the regular differencing, providing support for an AR and/or MA component(s).
  • A seasonal difference was required, confirming the seasonal behavior at 12 months.

3.11.2.4 Seasonality

The seasonality check gave a high Mean Absolute Deviation (MAD) of around 2.3 million for the seasonal component. This confirms the hypotheses of seasonality at the annual periodicity.

3.11.2.5 Final Model Selection

Based on the above:

  • The PACF supports including an AR(1) term
  • The need for 3 regular differences indicates adding an MA(3)
  • The seasonal ACF/PACF points to adding a SAR(1) at 12 months
  • No evidence found for a seasonal MA, leaving SMA at 0

Therefore, I would start with fitting a SARIMA(1,1,3)(1,1,0)[12] model to this time series.

3.11.3 C, C, W

3.11.3.1 Autocorrelation Function (ACF)

The ACF plot for the C,C,W series shows statistically significant correlations extending out to lags around 12-15. There are additional smaller spikes at lags 1-5. The periodic seasonal spikes indicate a potential seasonal cycle at 12 months. The earlier spikes suggest there could be a higher order autoregressive process.

3.11.3.2 Partial Autocorrelation Function (PACF)

The PACF plot shows a slow decay in the correlations without clearly cutting off. This suggests the ACF spikes are from a higher order AR rather than an MA process. There is also a spike at lag 12 relating to the seasonal pattern.

3.11.3.3 Differencing

The Augmented Dickey-Fuller test showed this series was non-stationary initially. After applying 1 seasonal difference and 2 regular differences, stationarity was achieved. This tells us:

  • A non-stationary trend was removed by regular differencing, supporting an AR/MA component(s).
  • The seasonal differencing confirms the 12 month seasonal cycle.

3.11.3.4 Seasonality

The seasonality check gave a high Mean Absolute Deviation (MAD) of 2.6 million for the seasonal component, pointing to a strong seasonal effect.

3.11.3.5 Final Model Selection

Based on the analyses:

  • The slowly decaying PACF indicates an AR(2) term
  • 2 regular differences indicates adding an MA(2)
  • The ACF and seasonal PACF spikes points to a SAR(1)[12]
  • No support found for seasonal MA, leaving SMA at 0

Therefore, I would start with fitting a SARIMA(2,1,2)(1,1,0)[12] for this time series.

4 Forecasting

4.1 Reminder of What Variables for Forecasting?

(Code and output hidden; see .rmd for code)

4.2 Forecasting Plots and Prints

Series: ts_data 
ARIMA(1,1,0)(0,1,0)[12] 

Coefficients:
          ar1
      -0.7621
s.e.   0.1673

sigma^2 = 1.177e+13:  log likelihood = -279.86
AIC=563.73   AICc=564.58   BIC=565.39

Training set error measures:
                    ME    RMSE     MAE     MPE     MAPE      MASE       ACF1
Training set -281927.6 2505690 1602520 223.335 323.0029 0.6869274 -0.3277715

Series: ts_data 
ARIMA(1,1,0)(0,1,0)[12] 

Coefficients:
          ar1
      -0.7782
s.e.   0.1702

sigma^2 = 4.501e+12:  log likelihood = -271.72
AIC=547.44   AICc=548.3   BIC=549.11

Training set error measures:
                    ME    RMSE      MAE      MPE     MAPE      MASE       ACF1
Training set -24968.57 1549353 905330.8 67.48467 77.38088 0.6973285 -0.2433445

Series: ts_data 
ARIMA(0,1,1)(0,1,0)[12] 

Coefficients:
          ma1
      -0.8700
s.e.   0.1799

sigma^2 = 9.741e+11:  log likelihood = -274.17
AIC=552.35   AICc=553.15   BIC=554.13

Training set error measures:
                   ME     RMSE      MAE      MPE     MAPE      MASE       ACF1
Training set 155812.9 730871.4 446412.1 211.4865 226.9794 0.5925095 0.01513287

5 Summary of Assignment

5.1 Stats/Forecasting Summary

5.1.1 A,A,M Series

  • Model: The time series was forecasted using a SARIMA(1,1,3)(1,1,0)[12] model.
  • Stationarity: The Augmented Dickey-Fuller test indicated stationarity after 1 seasonal difference and 3 regular differences.
  • Seasonality: A very high Mean Absolute Deviation (MAD) of around 3.9 million for the seasonal component was observed, suggesting significant seasonality at 12 months.
  • Error Metrics: The model exhibited high prediction errors on the training set with an RMSE of 2,505,690 and MAPE of 323.00%, indicating potential model inadequacy in capturing the data’s patterns.

5.1.2 B,C,W Series

  • Model: A SARIMA(1,1,3)(1,1,0)[12] model was used for forecasting.
  • Stationarity: After 1 seasonal difference and 3 regular differences, the series was deemed stationary, although one of the differencing p-values was marginally above 0.05, hinting at possible residual non-stationarity.
  • Seasonality: The series exhibited a high MAD of around 2.3 million for the seasonal component, confirming annual seasonality.
  • Error Metrics: The RMSE was 1,549,353 and MAPE was 77.38%, which are lower than the A,A,M series but still indicate a substantial prediction error.

5.1.3 C,C,W Series

  • Model: Forecasting was conducted with a SARIMA(2,1,2)(1,1,0)[12] model.
  • Stationarity: The series achieved stationarity after 1 seasonal difference and 2 regular differences, as per the Augmented Dickey-Fuller test.
  • Seasonality: A high MAD for the seasonal component of 2.6 million was observed, pointing to a strong seasonal effect.
  • Error Metrics: The model reported an RMSE of 730,871 and MAPE of 226.98%, suggesting a better fit than the A,A,M series but still relatively high errors.

5.1.4 General Summary

The forecasting models for all three time series indicated high prediction errors on the training data, which suggests that the models may not be capturing the underlying patterns effectively. The significant seasonality detected in all series is properly addressed by the models, yet the high error metrics imply that further refinement of the models, possibly through parameter tuning or the inclusion of additional explanatory variables, could improve forecast accuracy. The findings call for a cautious approach to relying on these forecasts without further model improvements and validation against unseen data.

5.2 For Future Improvement

The time series analysis conducted on the A_A_M, B_C_W, and C_C_W series offered valuable insights into the underlying patterns and potential for forecasting. However, there are several avenues for improvement that could enhance the reliability and accuracy of future models:

  1. Model Optimization: While the SARIMA models provided a starting point for analysis, further refinement is necessary. The high RMSE and MAPE values indicate that the models may not be capturing the underlying process adequately. More sophisticated model selection techniques or additional explanatory variables could improve performance.

  2. Data Wrangling Efficiency: Given time constraints, utilizing auto.arima can streamline the model selection process after initial data wrangling. This function can automate the identification of optimal model parameters, saving valuable time and computational resources.

  3. Time Series Duration: The current dataset spans four years, which may not capture longer-term cyclical behavior or structural changes in the data. Time series analysis often benefits from longer periods to discern between random fluctuations and true patterns. Future studies should consider extending the timeframe if data availability allows.

  4. Additional Data: Incorporating more granular data or external variables could help to explain some of the variance not accounted for by the time series models alone. Economic indicators, market trends, or categorical events could provide further context for the fluctuations observed in the series.

  5. Model Diagnostics: Post-modeling diagnostics play a critical role in validating the assumptions of the time series models. Checks for autocorrelation, non-normality, and heteroscedasticity in the residuals can signal the need for model adjustments or additional differencing.

  6. Forecasting Evaluation: The accuracy of forecasts should be evaluated against a holdout sample or through time series cross-validation to provide a more robust measure of the model’s predictive capabilities.

  7. Seasonality Adjustments: The significant seasonality indicated by the Mean Absolute Deviation (MAD) in the series underscores the need to refine how seasonal effects are accounted for, possibly through more complex seasonal models or transformation techniques.

  8. Training Data Concerns: The high training errors observed suggest the models may not generalize well to unseen data. Future models should focus on improving the fit on the training data without overfitting, possibly through regularization techniques or model averaging.

  9. Software and Tools: Upgrading to more advanced statistical software or leveraging machine learning tools may offer better functionality for modeling complex time series data and automating parts of the analytical process.

  10. Collaborative Efforts: Time series forecasting can benefit from collaborative efforts, bringing together domain experts and data scientists to ensure that models are not only statistically sound but also grounded in real-world phenomena.

By addressing these areas, the predictive power and reliability of time series models used in future analyses can be significantly improved, leading to more accurate forecasts and better-informed decision-making.